Reward Maximization Under Uncertainty: Leveraging Side-Observations Reward Maximization Under Uncertainty: Leveraging Side-Observations on Networks
نویسندگان
چکیده
We study the stochastic multi-armed bandit (MAB) problem in the presence of sideobservations across actions that occur as a result of an underlying network structure. In our model, a bipartite graph captures the relationship between actions and a common set of unknowns such that choosing an action reveals observations for the unknowns that it is connected to. This models a common scenario in online social networks where users respond to their friends’ activity, thus providing side information about each other’s preferences. Our contributions are as follows: 1) We derive an asymptotic lower bound (with respect to time) as a function of the bi-partite network structure on the regret of any uniformly good policy that achieves the maximum long-term average reward. 2) We propose two policies a randomized policy; and a policy based on the well-known upper confidence bound (UCB) policies both of which explore each action at a rate that is a function of its network position. We show, under mild assumptions, that these policies achieve the asymptotic lower bound on the regret up to a multiplicative factor, independent of the network structure. Finally, we use numerical examples on a real-world social network and a routing example network to demonstrate the benefits obtained by our policies over other existing policies.
منابع مشابه
Reward Maximization Under Uncertainty: Leveraging Side-Observations on Networks
We study the stochastic multi-armed bandit (MAB) problem in the presence of sideobservations across actions that occur as a result of an underlying network structure. In our model, a bipartite graph captures the relationship between actions and a common set of unknowns such that choosing an action reveals observations for the unknowns that it is connected to. This models a common scenario in on...
متن کاملReward-Rate Maximization in Sequential Identification under a Stochastic Deadline
Abstract. Any intelligent system performing evidence-based decision making under time pressure must negotiate a speed-accuracy trade-off. In computer science and engineering, this is typically modeled as minimizing a Bayes-risk functional that is a linear combination of expected decision delay and expected terminal decision loss. In neuroscience and psychology, however, it is often modeled as m...
متن کاملExpectation Maximization for Average Reward Decentralized POMDPs
Planning for multiple agents under uncertainty is often based on decentralized partially observable Markov decision processes (DecPOMDPs), but current methods must de-emphasize long-term effects of actions by a discount factor. In tasks like wireless networking, agents are evaluated by average performance over time, both short and longterm effects of actions are crucial, and discounting based s...
متن کاملOptimal Temporal Risk Assessment
Time is an essential feature of most decisions, because the reward earned from decisions frequently depends on the temporal statistics of the environment (e.g., on whether decisions must be made under deadlines). Accordingly, evolution appears to have favored a mechanism that predicts intervals in the seconds to minutes range with high accuracy on average, but significant variability from trial...
متن کاملDynamics of betting behavior under flat reward condition
One of the missions of the cognitive process of animals, including humans, is to make reasonable judgments and decisions in the presence of uncertainty. The balance between exploration and exploitation investigated in the reinforcement-learning paradigm is one of the key factors in this process. Recently, following the pioneering work in behavioral economics, growing attention has been directed...
متن کامل